Skip to content

fix(bootstrap): detect missing sandbox supervisor binary during gateway health check#281

Merged
drew merged 5 commits intomainfrom
fix-missing-supervisor-healthcheck/an
Mar 13, 2026
Merged

fix(bootstrap): detect missing sandbox supervisor binary during gateway health check#281
drew merged 5 commits intomainfrom
fix-missing-supervisor-healthcheck/an

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Mar 13, 2026

Summary

  • Add HEALTHCHECK_MISSING_SUPERVISOR check to cluster-healthcheck.sh that verifies /opt/openshell/bin/openshell-sandbox exists and is executable
  • Add early detection of the marker in the bootstrap polling loop (runtime.rs) so gateway start fails fast with actionable guidance instead of timing out after 6 minutes
  • Add structured error diagnosis in errors.rs with recovery steps (rebuild image, recreate gateway)
  • Update debug-openshell-cluster skill with the new failure pattern and diagnostic command

Context

When the published cluster image is missing the sandbox supervisor binary (e.g. built before the supervisor-builder stage was added), the gateway reports healthy but every sandbox pod crashes immediately with:

exec: "/opt/openshell/bin/openshell-sandbox": stat /opt/openshell/bin/openshell-sandbox: no such file or directory

This is a confusing failure because the gateway health check passes, openshell status shows the server is up, but no sandboxes can start. The fix ensures the health check catches this condition and the bootstrap surfaces a clear error with recovery instructions.

Test Plan

  • cargo check -p openshell-bootstrap passes
  • All 69 unit tests in openshell-bootstrap pass
  • mise run pre-commit passes

@drew drew self-assigned this Mar 13, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Mar 13, 2026
drew added 5 commits March 13, 2026 14:05
…ay health check

The cluster health check now verifies that /opt/openshell/bin/openshell-sandbox
exists in the gateway container. Without this binary, every sandbox pod crashes
with 'no such file or directory' but the gateway previously reported healthy.

- Add HEALTHCHECK_MISSING_SUPERVISOR marker to cluster-healthcheck.sh
- Add early detection in bootstrap polling loop (runtime.rs)
- Add structured error diagnosis with recovery steps (errors.rs)
- Update debug-openshell-cluster skill with new failure pattern
Large sandbox images (e.g. 852MB base image) can take ~3 minutes to
pull, exceeding the previous 120s timeout. Bump to 300s across all
surfaces: CLI watch loop, server orphan grace period, TUI ready poll,
Python SDK wait_ready default, and E2E test harness.
… network

Teardown of existing gateway resources (container, volume, image) is now
performed inside deploy_gateway_with_logs() so both 'gateway start' and
the auto-bootstrap path in 'sandbox create' get identical cleanup.

Remove the dedicated 'openshell-cluster' bridge network — it was only
used by a single container and the default Docker bridge is sufficient.
This eliminates the network create/remove retry loops and stale endpoint
cleanup that added complexity without benefit.

Fix the provisioning stream timeout in run.rs: wrap stream.next() in
tokio::time::timeout() so the 300s deadline fires even when the gRPC
stream stops producing events.
… gateway

Instead of unconditionally tearing down existing gateway resources on
every `gateway start`, restore the interactive prompt that asks the
user whether to destroy and recreate. When --recreate is passed, the
prompt is skipped and resources are destroyed directly. In
non-interactive mode, the existing gateway is reused silently.

The deploy_gateway_with_logs function now respects a `recreate` field
on DeployOptions — only destroying Docker resources (container, image,
volume) when explicitly requested. The auto-bootstrap path sets
recreate=true to handle stale Docker resources without metadata.
@drew drew force-pushed the fix-missing-supervisor-healthcheck/an branch from e37ba99 to a27aaf4 Compare March 13, 2026 21:08
@drew drew merged commit 37c9ae7 into main Mar 13, 2026
11 of 12 checks passed
@drew drew deleted the fix-missing-supervisor-healthcheck/an branch March 13, 2026 22:07
drew added a commit that referenced this pull request Mar 14, 2026
PR #281 removed the shared openshell-cluster Docker network in favor of
the default bridge. This restores custom bridge networking but makes each
gateway use its own isolated network named openshell-cluster-{name},
matching the existing container/volume naming convention.

Changes:
- Add network_name() to constants.rs for per-gateway network naming
- Add ensure_network() with retry/backoff and force_remove_network()
  parameterized by network name instead of a global constant
- Attach containers to their per-gateway network via network_mode
- Disconnect and remove the network during gateway destroy
- Wire ensure_network() into the deploy flow before ensure_volume()
- Update architecture docs to reflect per-gateway network isolation
drew added a commit that referenced this pull request Mar 14, 2026
PR #281 removed the shared openshell-cluster Docker network in favor of
the default bridge. This restores custom bridge networking but makes each
gateway use its own isolated network named openshell-cluster-{name},
matching the existing container/volume naming convention.

Changes:
- Add network_name() to constants.rs for per-gateway network naming
- Add ensure_network() with retry/backoff and force_remove_network()
  parameterized by network name instead of a global constant
- Attach containers to their per-gateway network via network_mode
- Disconnect and remove the network during gateway destroy
- Wire ensure_network() into the deploy flow before ensure_volume()
- Update architecture docs to reflect per-gateway network isolation
drew added a commit that referenced this pull request Mar 16, 2026
drew added a commit that referenced this pull request Mar 16, 2026
PR #281 removed the shared openshell-cluster Docker network in favor of
the default bridge. This restores custom bridge networking but makes each
gateway use its own isolated network named openshell-cluster-{name},
matching the existing container/volume naming convention.

Changes:
- Add network_name() to constants.rs for per-gateway network naming
- Add ensure_network() with retry/backoff and force_remove_network()
  parameterized by network name instead of a global constant
- Attach containers to their per-gateway network via network_mode
- Disconnect and remove the network during gateway destroy
- Wire ensure_network() into the deploy flow before ensure_volume()
- Update architecture docs to reflect per-gateway network isolation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant